12 research outputs found
On the Place of Text Data in Lifelogs, and Text Analysis via Semantic Facets
Current research in lifelog data has not paid enough attention to analysis of
cognitive activities in comparison to physical activities. We argue that as we
look into the future, wearable devices are going to be cheaper and more
prevalent and textual data will play a more significant role. Data captured by
lifelogging devices will increasingly include speech and text, potentially
useful in analysis of intellectual activities. Analyzing what a person hears,
reads, and sees, we should be able to measure the extent of cognitive activity
devoted to a certain topic or subject by a learner. Test-based lifelog records
can benefit from semantic analysis tools developed for natural language
processing. We show how semantic analysis of such text data can be achieved
through the use of taxonomic subject facets and how these facets might be
useful in quantifying cognitive activity devoted to various topics in a
person's day. We are currently developing a method to automatically create
taxonomic topic vocabularies that can be applied to this detection of
intellectual activity
Determining the Characteristic Vocabulary for a Specialized Dictionary using Word2vec and a Directed Crawler
Specialized dictionaries are used to understand concepts in specific domains,
especially where those concepts are not part of the general vocabulary, or
having meanings that differ from ordinary languages. The first step in creating
a specialized dictionary involves detecting the characteristic vocabulary of
the domain in question. Classical methods for detecting this vocabulary involve
gathering a domain corpus, calculating statistics on the terms found there, and
then comparing these statistics to a background or general language corpus.
Terms which are found significantly more often in the specialized corpus than
in the background corpus are candidates for the characteristic vocabulary of
the domain. Here we present two tools, a directed crawler, and a distributional
semantics package, that can be used together, circumventing the need of a
background corpus. Both tools are available on the web
Wiring Kenyan Languages for the Global Virtual Age: An audit of the Human Language Technology Resources
Abstract Whereas we recognize the advancement of computing and internet technologies over the years and its impact in the areas of health, education, government, etc.
KenSwQuAD -- A Question Answering Dataset for Swahili Low Resource Language
The need for Question Answering datasets in low resource languages is the
motivation of this research, leading to the development of Kencorpus Swahili
Question Answering Dataset, KenSwQuAD. This dataset is annotated from raw story
texts of Swahili low resource language, which is a predominantly spoken in
Eastern African and in other parts of the world. Question Answering (QA)
datasets are important for machine comprehension of natural language for tasks
such as internet search and dialog systems. Machine learning systems need
training data such as the gold standard Question Answering set developed in
this research. The research engaged annotators to formulate QA pairs from
Swahili texts collected by the Kencorpus project, a Kenyan languages corpus.
The project annotated 1,445 texts from the total 2,585 texts with at least 5 QA
pairs each, resulting into a final dataset of 7,526 QA pairs. A quality
assurance set of 12.5% of the annotated texts confirmed that the QA pairs were
all correctly annotated. A proof of concept on applying the set to the QA task
confirmed that the dataset can be usable for such tasks. KenSwQuAD has also
contributed to resourcing of the Swahili language.Comment: 17 pages, 1 figure, 10 table
Word Embedding and Statistical Based Methods for Rapid Induction of Multiple Taxonomies
In this paper we present two methodologies for rapidly inducing multiple subject-specific taxonomies from crawled data. The first method involves a sentence-level words co-occurrence frequency method for building the taxonomy, while the second involves the bootstrapping of a Word2Vec based algorithm with a directed crawler. We exploit the multilingual open-content directory of the World Wide Web, DMOZ 1 to seed the crawl, and the domain name to direct the crawl. This domain corpus is then input to our algorithm that can automatically induce taxonomies. The induced taxonomies provide hierarchical semantic dimensions for the purposes of faceted browsing. As part of an ongoing personal semantics project, we applied the resulting taxonomies to personal social media data (Twitter, Gmail, Facebook, Instagram, Flickr) with an objective of enhancing an individual's exploration of their personal information through faceted searching. We also perform a comprehensive corpus based evaluation of the algorithms based on many datasets drawn from the fields of medicine (diseases) and leisure (hobbies) and show that the induced taxonomies are of high qualit
Extracting Hierarchical Topic Models from the Web for Improving Digital Archive Access
International audienceTopic models provide a weighted list of terms specific to a given domain. For example, the terminology for painting, as a hobby, might include specific tools user in painting such brush, easel, canvas, as well as more specific terms such a common oil colors: deep aquamarine, cerulean blue, zinc white. For clothing, a topic model should include words such as shoes, boots, socks, skirt, hats, as well as more specific terms such as tennis shoes, cocktail dress, and specific brands of shoes, hats, shirts, etc. In addition to containing the characteristic terms of a topic, a topic model also contains the relative frequency of each term's use in the topic text. This frequency is useful in information retrieval settings; when a large number of results are returned for a query, they can be ordered by pertinence using the relative frequency of domain words to rank the responses. Providing a hierarchic topic model also allows an information retrieval application to create facets (Tunkelang, 2009), or categories appearing the result sets, with which the user can filter results, as on an online shopping site. One problem for many information retrieval platforms in digital humanity archives is the lack of topic models, other than those already foreseen and implemented when the archive was first digitized. A researcher wishing to look at a collection or archive from a new angle has no means of exploiting a new topic model corresponding to his or her axis of research. This obstacle has two causes: (1) technologically, the platform has to allow a re-annotation of the underlying archive with a new topic model. This technological problem is solvable by implementing a suite of natural language processing tools that can access the description of the textual description of elements in the archive, and identify there terms from a new topic model. For example, the commonly used information retrieval platform Lucene (Grainger, 2014) allows the administrator to add new facet annotations to existing documents. A second, more difficult problem is (2) building a new topic model. When done manually, this is a time-consuming task, with no assurance of being complete or adequate, unless great expense is outlayed, as is the case for MeSH, a medical subject heading taxonomy (Coletti and Bleich, 2001), for which regular monthly meetings are held for maintaining and updating the terminology. For subjects less important for society, few such ontological resources exist. When topic models are created automatically they can homogenize existing terminology (Newman et al, 2007) but often result in noise (Steyvers, et al., 2004) that may seem excessive to some archivists